-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TEP-0091] Add Verification at reconciler #5581
[TEP-0091] Add Verification at reconciler #5581
Conversation
Skipping CI for Draft Pull Request. |
a89e323
to
00791e8
Compare
/assign @wlynch |
The following is the coverage report on the affected files.
|
/assign @dibyom |
cmd/sign/main.go
Outdated
} | ||
} | ||
if *kmsKey != "" { | ||
signer, err = kms.Get(ctx, *kmsKey, crypto.SHA256) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/s KMS doesn't work quite the same way as cosign since it's trying to minimize deps - you need to import all of the desired providers from https://github.com/sigstore/sigstore/tree/main/pkg/signature/kms manually, which will call ProviderInit to register the prefix.
The reason for this is that the providers are a big source of the massive dependency creep in cosign and related tools, since the cloud providers will pull in their corresponding SDKs.
I don't think we want to put this dependency burden on pipelines - at the very least you should make this its own module, though you may find it easier to just break this out into its own repo in the tektoncd org.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Followed up off-thread - KMS dependency is likely unavoidable since we'll need it on the reconciler side anyway.
Import away 🙈
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would still recommend a separate repo for this long term though - you'll likely find it easier to maintain / distribute. Process isn't too bad - just follow the process here: https://github.com/tektoncd/community/blob/main/process.md#proposing-projects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I remove those deps from this PR? I just tried to push them in a new commit and added 13k lines of code. 😢
Then this commit doesn't really work for kms. (Can be added later when we have a separate repo)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that sounds about right. It's kinda unavoidable given that you're pulling in the SDKs.
cmd/verify/main.go
Outdated
limitations under the License. | ||
*/ | ||
|
||
package main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of making 2 separate binaries, start prepping this to be a tkn plugin by making this 1 single binary with subcommands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ok to keep these 2 binaries before tkn support? Then we depreciate/remove them.
The tkn side may take longer time, including TEP and implementation. Just worried we have no way to sign files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need a TEP for a tkn plugin - you just have to install the binary with the name tkn-<plugin name>
https://github.com/tektoncd/cli/blob/78665b7ea7583a2e76d85d5996924d5bd5ebad82/pkg/plugins/plugins.go#L34-L50
You could still do this with 2 binaries if you wanted, though it becomes a little more annoying to distribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add these into tkn as subcommand. Created an issue here tektoncd/cli#1733.
So for this commit in the doc I will point to experimental for signing, and when tkn supports we can update the doc
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
Why do we need these new /hold |
Sure, I will remove them and add to tkn |
cbc10a9
to
0a5132c
Compare
The following is the coverage report on the affected files.
|
/hold cancel Thanks! Also, looks like you need to do a |
c2664cc
to
75d8cfe
Compare
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
a3e76a8
to
8e8fc96
Compare
The following is the coverage report on the affected files.
|
8e8fc96
to
e00337c
Compare
The following is the coverage report on the affected files.
|
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for adding e2e tests for this!!
I wonder if it would make sense to reduce the number of e2e test cases? (Because e2e tests are a bit slow and also tend to be difficult to maintain)
The test cases we have are:
- "Signed Task Passes Verification",
- "Tampered Task Fails Verification with tampered content",
- "Unsigned Task Fails Verification without signature",
- "Signed Pipeline Passes Verification",
- "Tampered Pipeline Fails Verification with tampered content",
- "Unsigned Pipeline Fails Verification without signature",
I think it would be worth thinking about:
- What would cause one of these tests to fail that wouldn't be caught by one of the other e2e test cases?
- How many of these are directly duplicating test cases we have at the reconciler and unit test level?
Thinking about (1) I'm wondering if we need to cover both Pipelines and Tasks at this level - feel free to push back on this b/c maybe the logic is significantly different but I'm wondering if the logic is similar enough that if there was a problem, both the Pipeline and Task tests would fail (i.e. we only need one of the two sets)
Thinking about (2) I'm wondering if we need to test the error cases at all at this level (or the reverse? maybe we only test that tampered tasks fail) - since the main goal of e2e tests is to make sure everything is wired up properly and we want to leave the specifics to other (faster) tests, maybe we can get away with less here
^^ But take that all with a grain of salt! Maybe it's worth having 6 separate tests here. Or maybe we can just make the Pipeline vs Task optimization and stick with 3. Either way thanks again for writing these!
) | ||
|
||
func init() { | ||
os.Setenv("PRIVATE_PASSWORD", password) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this environment variable just for the test or part of the feature? Either way could we document it somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh sorry, this is only for the test. I will add comment for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added comment
test/trusted_resources_test.go
Outdated
ctx := context.Background() | ||
ctx, cancel := context.WithCancel(ctx) | ||
defer cancel() | ||
c, namespace := setup(ctx, t, requireAnyGate(neededFeatureFlags)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apologies for the potential scope creep here, but i want to double check that setting these flags as part of this test won't impact any other tests, do you know if that's the case? i'm wondering specifically two things:
- do we run the e2e tests in parallel? if so, do we do anything to make sure that other tests aren't impacted by controller wide config settings like this?
- if the test gets interrupted, will teardown occur that will reset these feature flags? (i would expect that the function being called by
CleanupOnInterrupt
would do something to reset these?)
In any case it's probably outside of the scope of this PR to do anything about these but i want to make sure that if these need to be addressed we do something to track that - BUT if these are already addressed, we should at least document how this works somewhere (maybe in a docstring for the setup
function?) - or if there are already docs for this feel free to just point me at them 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that is the tricky part of the tests.
- This setup function doesn't set the feature flag, it will only check if the needed flags are enabled or not, if not it will skip the current test. So here since we require alpha feature flag, the test will only run in alpha tests.
c, namespace := setup(ctx, t, requireAnyGate(neededFeatureFlags))
- In the current PR we turn on
resource-verification-mode
by making api request to the api server and will turn it off after the test finished (One issue is that we don't have the secret lister from client side, so there might be some latency between making the call and it really happen).
Though it seems working now since the CI passes and we're not running them in parallel (no-parallel
).
Line 83 in 55459dd
go_test_e2e -timeout=${E2E_GO_TEST_TIMEOUT} ./test/... || failed=1
I can add comment here that this would be trouble if we run tests in parallel.
I cannot think of a good solution here since this might be the first case that one feature flag may break others. It may not be worth to add a new CI job at this moment though it can be a way to fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This setup function doesn't set the feature flag, it will only check if the needed flags are enabled or not, if not it will skip the current test.
ah, kk, thanks for explaining!
I can add comment here that this would be trouble if we run tests in parallel.
kk sounds good - it would be nice if we could actually fail the test if someone tries to do it (if this went wrong, tests might just start randomly failing and it might be hard to trace that back to this comment) but I'm not sure how to do that (probably possible tho?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be not so hard to trace, since the error message will tell that the error is because we enable this feature flag. I think maybe I should raise this to the community and seek more feedback?
test/trusted_resources_test.go
Outdated
t.Fatalf("Failed to create PipelineRun `%s`: %s", pr.Name, err) | ||
} | ||
|
||
if tc.wantErr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend avoiding using the same test code to test error and success cases. it's a Tekton wide standard at https://github.com/tektoncd/community/blob/main/standards.md#tests - would be nice if we included some of the reasoning for this in the docs itself but in the meantime you can see it in the discussion here tektoncd/community#133 (comment) - TL;DR: it makes code confusing and brittle and is much clearer and easier to maintain as separate test cases (even within this test to reason about what it's doing you have to understand that there are multiple if tc.wantErr
blocks that completely change the test funcitonality)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! I started with having error and success cases separately but merge them to reduce the redundant code. I will divide them again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will keep this standard in mind!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed this in latest commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much! (If there is code it makes sense for them to share, you could always consider creating a library for that code - but also a little repetition is often not as bad as it seems - 'repetition is better than the wrong abstraction' :D (https://sandimetz.com/blog/2016/1/20/the-wrong-abstraction)
Yeah this is great suggestion! I thought about this when writing the pipeline test. I think the pipeline case with a taskref should be enough since it will also check if the task can pass the verification. They are all duplicate tests from the unit tests. 😅 Yeah I think we can only keep the Pipeline success and error which pipeline or task is tampered with. |
e00337c
to
e40510e
Compare
The following is the coverage report on the affected files.
|
e40510e
to
dcf6e5b
Compare
The following is the coverage report on the affected files.
|
test/trusted_resources_test.go
Outdated
|
||
var ( | ||
neededFeatureFlags = map[string]string{ | ||
"resource-verification-mode": "enforce", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if i'm understanding how requireAnyGate
works correctly, I think there might be something odd about including this flag here? it seems like requireAnyGate
will run the current test if any of the passed in flags are set - since this test is providing both resource-verification-mode
and enable-api-fields
, am I right that that means the test skipping/running would be:
resource-verification-mode
set,enable-api-fields
, set <-- run the testresource-verification-mode
unset (which is what we'd expect i think),enable-api-fields
, set <-- run the test
3.resource-verification-mode
set,enable-api-fields
, unset <-- run the test
4.resource-verification-mode
unset,enable-api-fields
, unset <-- skip the test
i.e. it would only skip the test in case (4)? i think we'd want to skip the test in case (3) as well? long story short I'm wondering if we should only pass enable-api-fields
to requireAnyGate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're totally correct! It will run if any of these 2 are set. It totally makes sense that we only pass enable-api-fields
, I will change it accordingly.
test/trusted_resources_test.go
Outdated
t.Fatalf("Failed to create PipelineRun `%s`: %s", pr.Name, err) | ||
} | ||
|
||
if tc.wantErr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much! (If there is code it makes sense for them to share, you could always consider creating a library for that code - but also a little repetition is often not as bad as it seems - 'repetition is better than the wrong abstraction' :D (https://sandimetz.com/blog/2016/1/20/the-wrong-abstraction)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! A few more comments - and apologies for bringing up a whole new thing around wantErr
🙏
I think the pipeline case with a taskref should be enough since it will also check if the task can pass the verification.
ah good call, that makes a lot of sense
Yeah I think we can only keep the Pipeline success and error which pipeline or task is tampered with.
great!! :D
Namespace: namespace, | ||
Annotations: map[string]string{"foo": "bar"}, | ||
}, | ||
wantErr: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apologies, I gave you some feedback about wantErr
in the context of the e2e tests but that would apply to the rest of these tests as well - I strongly recommend separating out the error cases into their own tests for improved readability and maintainability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure! Thanks for reminding me of those tests. Will do!
test/trusted_resources_test.go
Outdated
ctx := context.Background() | ||
ctx, cancel := context.WithCancel(ctx) | ||
defer cancel() | ||
c, namespace := setup(ctx, t, requireAnyGate(neededFeatureFlags)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This setup function doesn't set the feature flag, it will only check if the needed flags are enabled or not, if not it will skip the current test.
ah, kk, thanks for explaining!
I can add comment here that this would be trouble if we run tests in parallel.
kk sounds good - it would be nice if we could actually fail the test if someone tries to do it (if this went wrong, tests might just start randomly failing and it might be hard to trace that back to this comment) but I'm not sure how to do that (probably possible tho?)
test/trusted_resources_test.go
Outdated
} | ||
|
||
knativetest.CleanupOnInterrupt(func() { tearDown(ctx, t, c, namespace) }, t.Logf) | ||
defer tearDown(ctx, t, c, namespace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just triple checking - will resource-verification-mode
be reset back to it's original value after the test even if the test is interrupted or fails? (i.e. does teardown reset the value?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I didn't put the cleanup code inside the tearDown, I feel like tearDown is mainly used to clean the namespace we created for this test. Our current test need to reset the configmap at system namespace, so to address your comment I create a new helperfunction within this test to clean up the created secret and configmap.
The defer will enable the function to be executed even if the test fails. I have tested this behaviour.
Thank you very much for pointing this out!!
f18e823
to
939c4e5
Compare
The following is the coverage report on the affected files.
|
This commit is part of TEP-0091, before this commit we only have signing and verification functions but not used. This commit adds verification at reconciler after remote resolution and local resolution is done.
939c4e5
to
20c6c53
Compare
The following is the coverage report on the affected files.
|
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the revisions!! The e2e test cases are looking great, I really like the simplicity of having just the 2 cases (one success, one error) 🎉
/lgtm
|
||
c, namespace, secretName, signer := setupResourceVerificationConfig(ctx, t, requireAnyGate(neededFeatureFlags)) | ||
knativetest.CleanupOnInterrupt(func() { removeResourceVerificationConfig(ctx, t, c, namespace, secretName) }, t.Logf) | ||
defer removeResourceVerificationConfig(ctx, t, c, namespace, secretName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!! :D :D :D thanks for all the back and forth on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot!!!
Changes
This commit is part of TEP-0091, before this commit we only have signing and verification functions but not used. This commit adds verification at reconciler after remote resolution and local resolve is done.
There are mainly 2 parts in this PR:
pkg/apis/config
pkg/reconciler
Notes that KMS key is not supported in this commit since the dependency is not fully imported. This commit supports read pem file to public key.
/kind feature
Submitter Checklist
As the author of this PR, please check off the items in this checklist:
functionality, content, code)
/kind <type>
. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tepRelease Notes